Data: 1000 restaurants for each city

Cuisines: most popular (bar chart)
Chains: Rating and # franchinese (bokeh)
Distributions of features
- 2D distributions of features
Cuisines: price vs rating (bokeh... see next notebook)

TODO

Scatter plot: average rating and cost for each cuisine, cuisine at least N samples
Which cities have the greatest concentration of mexican, ethiopian, etc.
Determine restaurants with single category and then look for relationships between category and price, rating, review count, etc.
Explore data using a bokeh plot in http://localhost:8889/notebooks/examples/app/movies/Untitled.ipynb
- This is a plot could have on my webpage (not a dashboard) !!
What are the most popular and least popular?
Which cities are nicest, best restaurants? (may be sampling bias. maybe should use sort by alphabet?)
Which cities are cheapest?
plot poke on maps

Notes

Per capita analysis may not be valid because yelp searches around a city, not just where the population was counted
- e.g. South San Francisco search on Yelp likely brings up restaurants outside the range of population counted
- This could be assuaged if I instead delineate restaurants by the city their address says
Categories - might be overlapping
- I should look through top 100 and manually collapse some (deli and sanwich; japanese and sushi). One is a subset of another
BOkeh sometimes stops working in its exported html when make multiple in 1 notebook



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import glob
import os
import scipy as sp
from scipy import stats

from tools.plt import color2d #from the 'srcole/tools' repo
from matplotlib import cm

Load dataframes



In [2]:

    
# Load cities info
df_cities = pd.read_csv('/gh/data2/yelp/city_pop.csv', index_col=0)
df_cities.head()









    Out[2]:







  
    
      
      city
      state
      population
      total_food
      latitude
      longitude
      total_scraped
    
  
  
    
      0
      New York
      New York
      8537673
      54191
      40.705445
      -73.994293
      1000
    
    
      1
      Los Angeles
      California
      3976322
      41685
      34.061590
      -118.321381
      1000
    
    
      2
      Chicago
      Illinois
      2704958
      19315
      41.905159
      -87.677765
      1000
    
    
      3
      Houston
      Texas
      2303482
      15197
      29.784854
      -95.359955
      1000
    
    
      4
      Phoenix
      Arizona
      1615017
      11034
      33.465086
      -112.070160
      1000



In [3]:

    
# Load restaurants
df_restaurants = pd.read_csv('/gh/data2/yelp/food_by_city/df_restaurants.csv', index_col=0)
df_restaurants.head()









    Out[3]:







  
    
      
      id
      name
      city
      state
      rating
      review_count
      cost
      latitude
      longitude
      has_delivery
      has_pickup
      url
    
  
  
    
      0
      poquito-picante-brooklyn-2
      Poquito Picante
      New York
      New York
      4.5
      40
      2
      40.685742
      -73.981262
      True
      True
      https://www.yelp.com/biz/poquito-picante-brook...
    
    
      1
      nourish-brooklyn-4
      Nourish
      New York
      New York
      4.0
      65
      2
      40.677960
      -73.968550
      True
      True
      https://www.yelp.com/biz/nourish-brooklyn-4?ad...
    
    
      2
      taste-of-heaven-brooklyn
      Taste of Heaven
      New York
      New York
      5.0
      19
      2
      40.717150
      -73.940540
      False
      True
      https://www.yelp.com/biz/taste-of-heaven-brook...
    
    
      3
      milk-and-cream-cereal-bar-new-york
      Milk & Cream Cereal Bar
      New York
      New York
      4.5
      307
      2
      40.719580
      -73.996540
      False
      False
      https://www.yelp.com/biz/milk-and-cream-cereal...
    
    
      4
      the-bao-shoppe-new-york-2
      The Bao Shoppe
      New York
      New York
      4.0
      99
      1
      40.714345
      -73.990518
      False
      False
      https://www.yelp.com/biz/the-bao-shoppe-new-yo...



In [9]:

    
# Load categories by restaurant
df_categories = pd.read_csv('/gh/data2/yelp/food_by_city/df_categories.csv', index_col=0)
df_categories.head()









    Out[9]:







  
    
      
      acaibowls
      accessories
      active
      acupuncture
      adultedu
      advertising
      aerialfitness
      afghani
      african
      airport_shuttles
      ...
      wine_bars
      wineries
      winetasteclasses
      winetastingroom
      winetours
      womenscloth
      wraps
      yelpevents
      yoga
      zoos
    
  
  
    
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 684 columns

1. What are most popular categories?



In [10]:

    
# Manually concatenate categories with at least 500 counts
# Find categories D and V such that category 'D' should be counted as vategory 'V'
category_subsets = {'delis': 'sandwiches',
                    'sushi': 'japanese',
                    'icecream': 'desserts',
                    'cafes': 'coffee',
                    'sportsbars': 'bars',
                    'hotdog': 'hotdogs',
                    'wine_bars': 'bars',
                    'pubs': 'bars',
                    'cocktailbars': 'bars',
                    'beerbar': 'bars',
                    'tacos': 'mexican',
                    'gastropubs': 'bars',
                    'ramen': 'japanese',
                    'chocolate': 'desserts',
                    'dimsum': 'chinese',
                    'cantonese': 'chinese',
                    'szechuan': 'chinese',
                    'coffeeroasteries': 'coffee',
                    'hookah_bars': 'bars',
                    'irish_pubs': 'bars'}

for k in category_subsets.keys():
    df_categories[category_subsets[k]] = np.logical_or(df_categories[k], df_categories[category_subsets[k]])



In [11]:

    
# Remove some categories                                                                                       # R
category_remove = ['hotdog', 'cafes']
for k in category_remove:
    df_categories.drop(k, axis=1, inplace=True)



In [12]:

    
# Top categories
N = 20
category_counts = df_categories.sum().sort_values(ascending=False)
top_N_categories = list(category_counts.head(N).keys())
top_N_categories_counts = category_counts.head(N).values

category_counts.head(N)









    Out[12]:





sandwiches          73219.0
mexican             65826.0
hotdogs             63692.0
bars                55038.0
tradamerican        52656.0
pizza               50188.0
burgers             48277.0
coffee              42829.0
breakfast_brunch    39025.0
newamerican         31026.0
desserts            29916.0
chinese             28560.0
italian             28380.0
seafood             26280.0
japanese            24984.0
grocery             22727.0
salad               22291.0
bakeries            20171.0
foodtrucks          19650.0
chicken_wings       17667.0
dtype: float64



In [13]:

    
# Bar chart
plt.figure(figsize=(12,5))
plt.bar(np.arange(N), top_N_categories_counts / len(df_restaurants), color='k', ecolor='.5')
plt.xticks(np.arange(N), top_N_categories)
plt.ylabel('Fraction of restaurants', size=20)
plt.xlabel('Restaurant category', size=20)
plt.xticks(size=15, rotation='vertical')
plt.yticks(size=15);

2. What are the most common restaurant chains?



In [5]:

    
gb = df_restaurants.groupby('name')
df_chains = gb.mean()[['rating', 'review_count', 'cost']]
df_chains['count'] = gb.size()
df_chains.sort_values('count', ascending=False, inplace=True)
df_chains.head(10)









    Out[5]:







  
    
      
      rating
      review_count
      cost
      count
    
    
      name
      
      
      
      
    
  
  
    
      Subway
      2.922554
      5.474838
      1.815844
      8167
    
    
      McDonald's
      2.182025
      12.400315
      1.425018
      5708
    
    
      Starbucks
      3.321750
      23.568826
      1.735268
      4446
    
    
      Taco Bell
      2.620842
      11.928064
      1.300256
      3517
    
    
      Wendy's
      2.492046
      9.558568
      1.435647
      2766
    
    
      Burger King
      2.241244
      7.554769
      1.561103
      2684
    
    
      Dunkin' Donuts
      2.719626
      9.031598
      1.387628
      2247
    
    
      Walgreens
      2.964069
      4.881941
      1.994867
      2143
    
    
      Domino's Pizza
      2.943810
      13.044762
      1.875714
      2100
    
    
      Pizza Hut
      2.286355
      9.845618
      2.078187
      2008

2a. Correlations in chain properties

higher rating --> more reviews, fewer branches
more branches --> fewer reviews
'u' relationship between cost and review count: lots of reviews to branches that are consistently 1, or consistently 2, but not those that are inconsistent
- true for min_count = 40, 50, 100, 200,



In [6]:

    
# Only consider restaurants with at least 50 locations
min_count = 50
df_temp = df_chains[df_chains['count'] >= min_count]

plt.figure(figsize=(8,12))
plt_num = 1
for i, k1 in enumerate(df_temp.keys()):
    for j, k2 in enumerate(df_temp.keys()[i+1:]):
        if k1 in ['review_count', 'count']:
            if k2 in ['review_count', 'count']:
                plot_f = plt.loglog
            else:
                plot_f = plt.semilogx
        else:
            if k2 in ['review_count', 'count']:
                plot_f = plt.semilogy
            else:
                plot_f = plt.plot
        plt.subplot(3, 2, plt_num)
        plot_f(df_temp[k1], df_temp[k2], 'k.')
        plt.xlabel(k1)
        plt.ylabel(k2)
        plt_num += 1
        r, p = stats.spearmanr(df_temp[k1], df_temp[k2])
        plt.title(r)
plt.tight_layout()

2b. Number of franchises vs rating (bokeh)



In [28]:

    
from bokeh.io import output_notebook
from bokeh.layouts import row, widgetbox
from bokeh.models import CustomJS, Slider, Legend, HoverTool
from bokeh.plotting import figure, output_file, show, ColumnDataSource

output_notebook()

# Slider variables
min_N_franchises = 100

# Determine dataframe sources
df_chains2 = df_chains[df_chains['count'] > 10].reset_index()
df_temp = df_chains2[df_chains2['count'] >= min_N_franchises]

# Create data source for plotting and Slider callback
source1 = ColumnDataSource(df_temp, id='source1')
source2 = ColumnDataSource(df_chains2, id='source2')

hover = HoverTool(tooltips=[
    ("Name", "@name"),
    ("Avg Stars", "@rating"),
    ("# locations", "@count")])

# Make initial figure of net income vs years of saving
plot = figure(plot_width=400, plot_height=400,
              x_axis_label='Number of locations',
              y_axis_label='Average rating',
              x_axis_type="log", tools=[hover])

plot.scatter('count', 'rating', source=source1, line_width=3, line_alpha=0.6, line_color='black')

# Declare how to update plot on slider change
callback = CustomJS(args=dict(s1=source1, s2=source2), code="""
    var d1 = s1.get("data");
    var d2 = s2.get("data");
    var N = N.value;
    d1["count"] = [];
    d1["rating"] = [];
    for(i=0;i <=d2["count"].length; i++){
        if (d2["count"][i] >= N) {
        d1["count"].push(d2["count"][i]);
        d1["rating"].push(d2["rating"][i]);
        d1["name"].push(d2["name"][i]);
        }
    }

    s1.change.emit();
""")

N_slider = Slider(start=10, end=1000, value=min_N_franchises, step=10,
                  title="minimum number of franchises", callback=callback)
callback.args["N"] = N_slider

# Define layout of plot and sliders
layout = row(plot, widgetbox(N_slider))

# Output and show
output_file("/gh/srcole.github.io/assets/misc/yelp_bokeh.html", title="Yelp WIP")
show(layout)









    





    
        
        Loading BokehJS ...
    






    














    






    
        
    







    



E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: cost, rating [renderer: GlyphRenderer(id='34516048-4a35-4a7c-b102-686c869fe5a4', ...)]

3. Distributions of ratings, review counts, and costs

3a. Distibutions



In [13]:

    
N_bins_per_factor10 = 8
bins_by_key = {'rating': np.arange(0.75, 5.75, .5),
               'review_count': np.logspace(1, 5, num=N_bins_per_factor10*4+1),
               'cost': np.arange(.5, 5, 1)}
log_by_key = {'rating': False,
               'review_count': True,
               'cost': False}

plt.figure(figsize=(12, 4))
for i, k in enumerate(bins_by_key.keys()):
    weights = np.ones_like(df_restaurants[k].values)/float(len(df_restaurants[k].values))
    plt.subplot(1, 3, i+1)
    plt.hist(df_restaurants[k].values, bins_by_key[k], log=log_by_key[k],
             color='k', edgecolor='.5', weights=weights)
    if k == 'review_count':
        plt.semilogx(1,1)
        plt.xlim((10, 40000))
    elif i == 0:
        plt.ylabel('Probability')
    plt.xlabel(k)
plt.tight_layout()

3b. Correlations (histograms)



In [14]:

    
# Prepare histogram analysis
gb_cost = df_restaurants.groupby('cost').groups
gb_rating = df_restaurants.groupby('rating').groups

# Remove 0 from gb_rating
gb_rating.pop(0.0)

N_bins_cost = len(gb_cost.keys())
N_bins_count = len(bins_by_key['review_count']) - 1
N_bins_rate = len(bins_by_key['rating']) - 1

# Hist: review count and rating as fn of cost
hist_count_by_cost = np.zeros((N_bins_cost, N_bins_count))
hist_rate_by_cost = np.zeros((N_bins_cost, N_bins_rate))
points_count_by_cost = np.zeros((N_bins_cost, 3))
points_rate_by_cost = np.zeros((N_bins_cost, 3))
for i, k in enumerate(gb_cost.keys()):
    # Make histogram of review count as fn of cost
    x = df_restaurants.loc[gb_cost[k]]['review_count'].values
    hist_temp, _ = np.histogram(x, bins=bins_by_key['review_count'])
    # Make each cost sum to 1
    hist_count_by_cost[i] = hist_temp / np.sum(hist_temp)
    # Compute percentiles
    points_count_by_cost[i,0] = np.mean(x)
    points_count_by_cost[i,1] = np.std(x)
    points_count_by_cost[i,2] = np.min([np.std(x), 5-np.mean(x)])
    
    # Repeat for rating
    x = df_restaurants.loc[gb_cost[k]]['rating'].values
    hist_temp, _ = np.histogram(x, bins=bins_by_key['rating'])
    hist_rate_by_cost[i] = hist_temp / np.sum(hist_temp)
    points_rate_by_cost[i,0] = np.mean(x)
    points_rate_by_cost[i,1] = np.std(x)
    points_rate_by_cost[i,2] = np.min([np.std(x), 5-np.mean(x)])
    
# Make histograms of review count as fn of rating
hist_count_by_rate = np.zeros((N_bins_rate, N_bins_count))
points_count_by_rate = np.zeros((N_bins_rate, 3))
for i, k in enumerate(gb_rating.keys()):
    # Make histogram of review count as fn of cost
    x = df_restaurants.loc[gb_rating[k]]['review_count'].values
    hist_temp, _ = np.histogram(x, bins=bins_by_key['review_count'])
    # Make each cost sum to 1
    hist_count_by_rate[i] = hist_temp / np.sum(hist_temp)
    points_count_by_rate[i,0] = np.mean(x)
    points_count_by_rate[i,1] = np.std(x)
    points_count_by_rate[i,2] = np.min([np.std(x), 5-np.mean(x)])



In [15]:

    
# Make a 2d colorplot
plt.figure(figsize=(10,4))
color2d(hist_rate_by_cost, cmap=cm.viridis,
        clim=[0,.4], cticks = np.arange(0,.41,.05), color_label='Probability',
        plot_xlabel='Rating', plot_ylabel='Cost ($)',
        plot_xticks_locs=range(N_bins_rate), plot_xticks_labels=gb_rating.keys(),
        plot_yticks_locs=range(N_bins_cost), plot_yticks_labels=gb_cost.keys(),
        interpolation='none', fontsize_minor=14, fontsize_major=19)

# On top, plot the mean and st. dev.
# plt.errorbar(points_rate_by_cost[:,0] / , np.arange(N_bins_cost), fmt='.', color='w', ms=10,
#              xerr=points_rate_by_cost[:,1:].T, ecolor='w', alpha=.5)



In [16]:

    
# Make a 2d colorplot
xbins_label = np.arange(0,N_bins_per_factor10*2+1, N_bins_per_factor10)
plt.figure(figsize=(10,4))
color2d(hist_count_by_cost, cmap=cm.viridis,
        clim=[0,.2], cticks = np.arange(0,.21,.05), color_label='Probability',
        plot_xlabel='Number of reviews', plot_ylabel='Cost ($)',
        plot_xticks_locs=xbins_label, plot_xticks_labels=bins_by_key['review_count'][xbins_label].astype(int),
        plot_yticks_locs=range(N_bins_cost), plot_yticks_labels=gb_cost.keys(),
        interpolation='none', fontsize_minor=14, fontsize_major=19)
plt.xlim((-.5,N_bins_per_factor10*2 + .5))









    Out[16]:





(-0.5, 16.5)



In [17]:

    
# Make a 2d colorplot
xbins_label = np.arange(0,N_bins_per_factor10*2+1, N_bins_per_factor10)
plt.figure(figsize=(10,6))
color2d(hist_count_by_rate, cmap=cm.viridis,
        clim=[0,.4], cticks = np.arange(0,.41,.1), color_label='Probability',
        plot_xlabel='Number of reviews', plot_ylabel='Rating',
        plot_xticks_locs=xbins_label, plot_xticks_labels=bins_by_key['review_count'][xbins_label].astype(int),
        plot_yticks_locs=range(N_bins_rate), plot_yticks_labels=gb_rating.keys(),
        interpolation='none', fontsize_minor=14, fontsize_major=19)
plt.xlim((-.5,N_bins_per_factor10*2 + .5))









    Out[17]:





(-0.5, 16.5)

	city	state	population	total_food	latitude	longitude	total_scraped
0	New York	New York	8537673	54191	40.705445	-73.994293	1000
1	Los Angeles	California	3976322	41685	34.061590	-118.321381	1000
2	Chicago	Illinois	2704958	19315	41.905159	-87.677765	1000
3	Houston	Texas	2303482	15197	29.784854	-95.359955	1000
4	Phoenix	Arizona	1615017	11034	33.465086	-112.070160	1000

	id	name	city	state	rating	review_count	cost	latitude	longitude	has_delivery	has_pickup	url
0	poquito-picante-brooklyn-2	Poquito Picante	New York	New York	4.5	40	2	40.685742	-73.981262	True	True	https://www.yelp.com/biz/poquito-picante-brook...
1	nourish-brooklyn-4	Nourish	New York	New York	4.0	65	2	40.677960	-73.968550	True	True	https://www.yelp.com/biz/nourish-brooklyn-4?ad...
2	taste-of-heaven-brooklyn	Taste of Heaven	New York	New York	5.0	19	2	40.717150	-73.940540	False	True	https://www.yelp.com/biz/taste-of-heaven-brook...
3	milk-and-cream-cereal-bar-new-york	Milk & Cream Cereal Bar	New York	New York	4.5	307	2	40.719580	-73.996540	False	False	https://www.yelp.com/biz/milk-and-cream-cereal...
4	the-bao-shoppe-new-york-2	The Bao Shoppe	New York	New York	4.0	99	1	40.714345	-73.990518	False	False	https://www.yelp.com/biz/the-bao-shoppe-new-yo...

	acaibowls	accessories	active	acupuncture	adultedu	advertising	aerialfitness	afghani	african	airport_shuttles	...	wine_bars	wineries	winetasteclasses	winetastingroom	winetours	womenscloth	wraps	yelpevents	yoga	zoos
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	rating	review_count	cost	count
name
Subway	2.922554	5.474838	1.815844	8167
McDonald's	2.182025	12.400315	1.425018	5708
Starbucks	3.321750	23.568826	1.735268	4446
Taco Bell	2.620842	11.928064	1.300256	3517
Wendy's	2.492046	9.558568	1.435647	2766
Burger King	2.241244	7.554769	1.561103	2684
Dunkin' Donuts	2.719626	9.031598	1.387628	2247
Walgreens	2.964069	4.881941	1.994867	2143
Domino's Pizza	2.943810	13.044762	1.875714	2100
Pizza Hut	2.286355	9.845618	2.078187	2008

	acaibowls	accessories	active	acupuncture	adultedu	advertising	aerialfitness	afghani	african	airport_shuttles	...	wine_bars	wineries	winetasteclasses	winetastingroom	winetours	womenscloth	wraps	yelpevents	yoga	zoos
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	acaibowls	accessories	active	acupuncture	adultedu	advertising	aerialfitness	afghani	african	airport_shuttles	...	wine_bars	wineries	winetasteclasses	winetastingroom	winetours	womenscloth	wraps	yelpevents	yoga	zoos
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0